This worksheet will take a look at some basic statistical analyses that can be done using R.
We will play around with some data. Read in the phenotype data that we’re going to be using for the practicals
phen <- read.table("../data/genetic/phen.txt", header=TRUE, stringsAsFactors=FALSE)
What does the data look like?
dim(phen)
## [1] 8237 7
head(phen)
## FID IID BMI DBP SBP CRP HT
## 1 id1 id1 31.86647 87.23318 152.2902 1.4446800 2
## 2 id2 id2 35.42533 86.05953 143.0827 3.7545641 2
## 3 id3 id3 29.77809 98.74657 234.1238 0.5148577 2
## 4 id4 id4 34.79715 102.91224 198.5010 4.9365146 2
## 5 id5 id5 26.92786 77.81723 151.0387 3.3885281 2
## 6 id6 id6 27.57659 67.16845 114.8678 1.5661815 1
What kind of data is ‘phen’?
class(phen)
## [1] "data.frame"
Data frames are tables, each column can be a different type of data. Let’s look at diastolic blood pressure (DBP). We can extract just the DBP column like this:
phen$DBP
This prints a lot of stuff to the console, we can use the ‘head’ function to avoid this, now it will only print the first 6 values:
head(phen$DBP)
## [1] 87.23318 86.05953 98.74657 102.91224 77.81723 67.16845
What kind of data is this column?
class(phen$DBP)
## [1] "numeric"
What kind of data is the FID column?
class(phen$FID)
## [1] "character"
So we see that there are different data types in the ‘phen’ data.frame. What does the DBP data look like?
summary(phen$DBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.41 75.32 83.22 83.11 91.26 130.90
Let’s plot the data
plot(phen$DBP)
This is just plotting the values in the order that it finds them. More useful might be a histogram
hist(phen$DBP)
We can get finer resolution
hist(phen$DBP, breaks=100)
What is the mean value of phen$DBP?
mean(phen$DBP)
## [1] 83.10646
What is the standard deviation?
sd(phen$DBP)
## [1] 11.76748
Is this normally distributed? Let’s compare it to the expected normal distribution curve for the same mean and standard deviation
hist(phen$DBP, prob=TRUE, density=20, main="Histogram of DBP and expected normal curve", xlab="DBP")
curve(
dnorm(x, mean=mean(phen$DBP), sd=sd(phen$DBP)),
col="darkblue",
lwd=2,
add=TRUE,
yaxt="n"
)
The standard error of the mean is calculated like this as the standard deviation divided by the sqrt of the sample size
sd(phen$DBP) / sqrt(length(phen$DBP))
## [1] 0.1296579
We could use R to calculate an empirical standard error.
To sample values with replacement, we can use the sample command, for example here is a vector of 1 to 10:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
Now we sample it with replacement:
sample(1:10, replace=TRUE)
## [1] 3 9 9 9 10 1 10 6 9 8
See that sometimes the same number has occurred more than once. We can do the same using the DBP phenotype.
DBP_sample <- sample(phen$DBP, replace=TRUE)
hist(DBP_sample)
Notice the mean is slightly different from the original values
mean(DBP_sample)
## [1] 83.09317
mean(phen$DBP)
## [1] 83.10646
Let’s do this 10 times
nsample <- 10
sample_means <- array(0, nsample)
for(i in 1:nsample)
{
DBP_sample <- sample(phen$DBP, replace=TRUE)
sample_means[i] <- mean(DBP_sample)
}
What do our sampled means look like? Plot a histogram of all the sample means
hist(sample_means)
Compare the standard deviation of the sample means to the standard error
sd(sample_means)
## [1] 0.1936376
sd(phen$DBP) / sqrt(length(phen$DBP))
## [1] 0.1296579
The standard error of the mean relates to what would happen if we did the same experiment lots of times. This is the logic behind ‘frequentist’ statistics
Is the mean value of BMI different between people with and without hypertension? Let’s look at the hypertension variable
summary(phen$HT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 2.00 1.76 2.00 2.00
table(phen$HT)
##
## 1 2
## 1978 6259
How does BMI differ between cases and controls? Let’s make a boxplot
boxplot(BMI ~ HT, phen)
Looks like we have an outlier in BMI. Let’s remove outliers using the rule: If a value is +/- 5 standard deviations from the mean then remove it
index1 <- phen$BMI > mean(phen$BMI) + 5 * sd(phen$BMI)
index2 <- phen$BMI < mean(phen$BMI) - 5 * sd(phen$BMI)
How many outliers do we have?
table(index1)
## index1
## FALSE TRUE
## 8236 1
table(index2)
## index2
## FALSE
## 8237
Remove anyone who is in index1 or index2
phen$BMI[index1 | index2] <- NA
NA is a special value in R that denotes that values are missing. Let’s calculate the mean of BMI
mean(phen$BMI)
## [1] NA
Problem - There are now missing values in the data. To account for this:
mean(phen$BMI, na.rm=TRUE)
## [1] 28.3236
What does the BMI distribution look like?
hist(phen$BMI, prob=TRUE, density=20, breaks=100, main="Histogram of BMI and expected normal curve", xlab="BMI")
curve(
dnorm(x, mean=mean(phen$BMI, na.rm=TRUE), sd=sd(phen$BMI, na.rm=TRUE)),
col="darkblue",
lwd=2,
add=TRUE,
yaxt="n"
)
There is some slight skewness to the data. But it’s not too bad. You can tell this is simulated data because you wouldn’t expect to see BMI values below 10!
Let’s look at the boxplot again
boxplot(BMI ~ HT, phen)
This part
BMI ~ HT
is known as a ‘formula’ in R. It says that BMI is a response of HT. In terms of the boxplot, it makes a different box for each value of HT. It looks like there is a difference in means. How much precision is there in these estimates?
Calculate the mean and standard error of the BMI amongst individuals without HT
m1 <- mean(phen$BMI[phen$HT == 1], na.rm=TRUE)
se1 <- sd(phen$BMI[phen$HT == 1], na.rm=TRUE) / sqrt(sum(phen$HT == 1 & !is.na(phen$BMI)))
Do the same for individuals who do have HT
m2 <- mean(phen$BMI[phen$HT == 2], na.rm=TRUE)
se2 <- sd(phen$BMI[phen$HT == 2], na.rm=TRUE) / sqrt(sum(phen$HT == 2 & !is.na(phen$BMI)))
A T-test can be conducted with this information. Calculate the T value:
t_value <- abs(m1 - m2) / sqrt(se1^2 + se2^2)
Get the P-value from the T value:
pt(t_value, df=nrow(phen), lower.tail=FALSE)
## [1] 3.837127e-106
Note - the precise DF is actually a more complicated formula. Just using sample size for simplicity.
A quicker way is to use R’s built in t.test function
t.test(phen$BMI[phen$HT == 1], phen$BMI[phen$HT == 2])
##
## Welch Two Sample t-test
##
## data: phen$BMI[phen$HT == 1] and phen$BMI[phen$HT == 2]
## t = -22.172, df = 3667.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.182405 -2.665302
## sample estimates:
## mean of x mean of y
## 26.10195 29.02581
Let’s look at the C-reactive protein (CRP) measures
hist(phen$CRP, breaks=100)
This is heavily skewed. And it looks like there are some outliers. Let’s log transform the variable
phen$logCRP <- log(phen$CRP)
hist(phen$logCRP)
This looks much better. Let’s remove any outliers as we did last time:
index1 <- phen$logCRP > mean(phen$logCRP) + 3 * sd(phen$logCRP)
index2 <- phen$logCRP < mean(phen$logCRP) - 3 * sd(phen$logCRP)
phen$logCRP[index1 | index2] <- NA
How many outliers were there?
table(is.na(phen$logCRP))
##
## FALSE TRUE
## 8207 30
Let’s plot this distribution
hist(phen$logCRP, prob=TRUE, density=20, breaks=100, main="Histogram of logCRP and expected normal curve", xlab="logCRP")
curve(
dnorm(x, mean=mean(phen$logCRP, na.rm=TRUE), sd=sd(phen$logCRP, na.rm=TRUE)),
col="darkblue",
lwd=2,
add=TRUE,
yaxt="n"
)
We’ve been using this same bit of code a lot. To be more efficient in future, we should use functions
plot_distribution <- function(y, main_title="Histogram and expected normal curve", x_title="Values")
{
hist(y, prob=TRUE, density=20, breaks=100, main=main_title, xlab=x_title)
curve(
dnorm(x, mean=mean(y, na.rm=TRUE), sd=sd(y, na.rm=TRUE)),
col="darkblue",
lwd=2,
add=TRUE,
yaxt="n"
)
}
Here we have defined our own function called plot_distribution. We can now pass any vector of values to it and it will do the same operation
plot_distribution(phen$logCRP, main_title="logCRP")
plot_distribution(phen$BMI, main_title="BMI")
plot_distribution(phen$DBP, main_title="DBP")
Is CRP associated with BMI?
plot(BMI ~ logCRP, phen)
It looks like there is a positive relationship, let’s draw a line of best fit. The regression line can be calculated using this formula:
slope <- cov(phen$BMI, phen$logCRP, use="pair") /
var(phen$logCRP, na.rm=TRUE)
intercept <- mean(phen$BMI, na.rm=TRUE) -
slope * mean(phen$logCRP, na.rm=TRUE)
plot(BMI ~ logCRP, phen)
abline(a = intercept, b = slope, col="darkred", lwd=3)
Alternatively, we can use R’s built in linear regression tools. This is how to calculate the linear regression:
lm(BMI ~ logCRP, data=phen)
##
## Call:
## lm(formula = BMI ~ logCRP, data = phen)
##
## Coefficients:
## (Intercept) logCRP
## 26.999 2.354
We can use this to draw the regression line
plot(BMI ~ logCRP, phen, col="grey")
abline(lm(BMI ~ logCRP, phen), col="blue", lwd=3)
lm is actually quite a powerful tool. It provides diagnostics about the regression
plot(lm(BMI ~ logCRP, phen))
These are used to interpret how influential a particular point is on the model
What happens if we want to estimate the effect of CRP but fitting SBP and DBP as covariates?
lm(BMI ~ logCRP + SBP + DBP, phen)
##
## Call:
## lm(formula = BMI ~ logCRP + SBP + DBP, data = phen)
##
## Coefficients:
## (Intercept) logCRP SBP DBP
## 18.16942 2.14796 0.02662 0.05533
We can get the standard errors of these estimates
summary(lm(BMI ~ logCRP + SBP + DBP, phen))
##
## Call:
## lm(formula = BMI ~ logCRP + SBP + DBP, data = phen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.7578 -3.4689 -0.1244 3.2125 19.2480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.169425 0.394796 46.022 <2e-16 ***
## logCRP 2.147962 0.067345 31.895 <2e-16 ***
## SBP 0.026624 0.002316 11.497 <2e-16 ***
## DBP 0.055329 0.006288 8.799 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.998 on 8202 degrees of freedom
## (31 observations deleted due to missingness)
## Multiple R-squared: 0.1841, Adjusted R-squared: 0.1838
## F-statistic: 616.8 on 3 and 8202 DF, p-value: < 2.2e-16
This is the result from a multiple regression. You can calculate this directly using matrix operations.
The matrix algebra for performing linear regression is:
\[ \hat{\textbf{b}} = (\textbf{X}'\textbf{X})^{-1}\textbf{X}'\textbf{y} \]
Let’s make the \(\textbf{X}\) matrix, which needs the variables plus a column of 1s to represent the intercept
X <- as.matrix(
data.frame(
intercept=1,
logCRP=phen$logCRP,
SBP=phen$SBP,
DBP=phen$DBP)
)
Note - handling missing values requires more complex methods, just setting missing values to 0 for convenience
X[is.na(X)] <- 0
y <- as.matrix(phen$BMI)
y[is.na(y)] <- 0
Now we can use R’s matrix operations to calculate the values for \(\textbf{b}\)
solve(t(X) %*% X, na.rm=TRUE) %*% t(X) %*% y
## [,1]
## intercept 18.14146924
## logCRP 2.13731058
## SBP 0.02669011
## DBP 0.05572816
How much variance is explained by logCRP on BMI? We can get this by calculating \(R^2\).
variance_explained <- cor(phen$logCRP, phen$BMI, use="pair")^2
What is the power of detecting an effect of this magnitude for different sample sizes? R has some built in power calculators. For more sophisticated power calculators see the R/pwr package. We’ll just use the basic one here
sample_sizes <- seq(50, 500, by=50)
pwr <- power.anova.test(
n=sample_sizes,
groups=2,
between.var=variance_explained,
within.var=1-variance_explained,
sig.level=0.05
)
Let’s plot the power as a function of sample size
plot(pwr$power ~ pwr$n)
What is the effect of BMI on hypertension? We could do an approximate odds ratio. Let’s dichotomise BMI to be values above the mean or below the mean. Create a new column in phen, and then fill in the values:
phen$BMI_binary <- NA
phen$BMI_binary[phen$BMI >= mean(phen$BMI, na.rm=TRUE)] <- 1
phen$BMI_binary[phen$BMI < mean(phen$BMI, na.rm=TRUE)] <- 0
What is the proportion of high and low BMI?
table(phen$BMI_binary)
##
## 0 1
## 4299 3937
table is a useful tool. You can also tabulate across multiple variables, for example what are the counts for all HT and BMI_binary values?
table(phen$HT, phen$BMI_binary)
##
## 0 1
## 1 1342 636
## 2 2957 3301
Use this calculate the odds ratio between hypertension and high/low BMI, using:
Odds of high BMI in a case / odds of low BMI in a case
tab <- table(phen$HT, phen$BMI_binary)
# Odds of high BMI in a case
ohb <- tab[1,1] / tab[1,2]
# Odds of low BMI in a case
olb <- tab[1,2] / tab[2,2]
odds_ratio <- ohb / olb
odds_ratio
## [1] 10.95176
Big effect… What is the effect of an increase in a BMI unit on risk of HT? We can use logistic regression to calculate the log of the odds ratio
glm(as.factor(HT) ~ BMI, phen, family="binomial")
##
## Call: glm(formula = as.factor(HT) ~ BMI, family = "binomial", data = phen)
##
## Coefficients:
## (Intercept) BMI
## -1.7334 0.1048
##
## Degrees of Freedom: 8235 Total (i.e. Null); 8234 Residual
## (1 observation deleted due to missingness)
## Null Deviance: 9081
## Residual Deviance: 8640 AIC: 8644
This is a generalised linear model - it uses the logit link function to regress BMI against the binary outcome. We can obtain standard errors using summary
summary(glm(as.factor(HT) ~ BMI, phen, family="binomial"))
##
## Call:
## glm(formula = as.factor(HT) ~ BMI, family = "binomial", data = phen)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5318 0.3482 0.6287 0.7806 1.4409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.733392 0.143735 -12.06 <2e-16 ***
## BMI 0.104826 0.005271 19.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9080.5 on 8235 degrees of freedom
## Residual deviance: 8639.5 on 8234 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 8643.5
##
## Number of Fisher Scoring iterations: 4
Let’s save the coefficients
res <- coefficients(summary(glm(as.factor(HT) ~ BMI, phen, family="binomial")))
res
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7333922 0.14373537 -12.05961 1.726026e-33
## BMI 0.1048261 0.00527093 19.88759 5.211937e-88
The probability of having hypertension for every increase of one unit in BMI:
exp(res[2,1])
## [1] 1.110517
We can visualise how the link function works by plotting it. E.g. for HT and DBP.
Let’s just use 20 values for clarity
index <- c(which(phen$HT == 1)[1:10], which(phen$HT == 2)[1:10])
Make the plot
plot(I(HT-1) ~ DBP,
phen[index,],
col="darkblue",
ylab="HT",
xlim=c(50,110)
)
mod <- glm(as.factor(HT) ~ DBP,
family=binomial,
phen,
na.action="na.exclude"
)
curve(predict(mod, data.frame(DBP=x), type="resp"), add=TRUE)
points(phen$DBP[index], fitted(mod)[index], pch=10)
R has its own data format. This usually has a suffix of .RData or .rdata or .rda. It is sometimes convenient to use R’s data format because it has preserved the objects and their classes exactly as you have them in your workspace. For example, if you save a table as a CSV, you might read the CSV into R again and a character column could be interpreted as a numeric column.
There is a small amount of genetic data stored in an ../data/example_data/geno.RData. Let’s load it in:
load("../data/example_data/geno.RData")
If you look in the workspace you’ll see there is a new object called snps. What are the dimensions of this object?
dim(snps)
## [1] 8237 100
What do the values look like? This shows the first 10 rows and 10 columns
snps[1:10, 1:10]
## rs6686302_A rs1192625_T rs794239_G rs794237_A rs875294_G rs11203258_A
## id1 0 1 1 1 0 0
## id2 0 0 0 0 0 0
## id3 2 0 2 0 2 2
## id4 1 0 1 0 2 1
## id5 0 0 0 0 1 0
## id6 1 0 1 0 1 1
## id7 0 0 0 0 1 0
## id8 0 0 0 0 1 0
## id9 1 0 1 0 1 1
## id10 0 0 0 0 1 0
## rs12028681_C rs16861446_T rs4532860_T rs1316880_A
## id1 1 0 1 0
## id2 0 0 0 0
## id3 0 0 0 0
## id4 1 0 0 1
## id5 1 0 0 1
## id6 0 0 0 0
## id7 1 0 0 1
## id8 1 0 0 1
## id9 0 0 0 0
## id10 1 0 0 1
The rows represent individuals, and the columns represent the SNPs.
Let’s test if the first SNP is in Hardy Weinberg equilibrium. To do this we need to calculate the expected number of genotypes, then test if this is different from the observed number of genotypes. We’ll use a simple Chi-square test to do this.
Calculate the expected genotype values…
\[ \begin{aligned} p_{AA} & = & p_{A}^2 \\ p_{Aa} & = & 2p_{A}(p_{a}) \\ p_{aa} & = & p_{a}^2 \end{aligned} \]
To do this in R
p_A <- mean(snps[,1], na.rm=TRUE) / 2
p_AA <- p_A^2
p_Aa <- 2 * p_A * (1-p_A)
p_aa <- (1 - p_A)^2
To get the observed values:
table(snps[,1])
##
## 0 1 2
## 4819 2949 469
The Chi-square test:
chisq.test(
x = table(snps[,1]),
p = c(p_aa, p_Aa, p_AA)
)
##
## Chi-squared test for given probabilities
##
## data: table(snps[, 1])
## X-squared = 0.40683, df = 2, p-value = 0.8159
There are R packages (e.g. R/GenABEL and R/LDheatmap) that will calculate the exact LD \(R^2\) or \(D'\) values from genetic data and visualise them. Let’s just do a simple approximation here - using the pairwise correlations between SNPs. This can be done very simply in R
pairwise_cor <- cor(snps, use="pair")^2
dim(pairwise_cor)
## [1] 100 100
pairwise_cor[1:10,1:10]
## rs6686302_A rs1192625_T rs794239_G rs794237_A rs875294_G
## rs6686302_A 1.00000000 3.446180e-02 0.523798704 0.018177963 2.783397e-01
## rs1192625_T 0.03446180 1.000000e+00 0.213228549 0.523168918 2.096868e-05
## rs794239_G 0.52379870 2.132285e-01 1.000000000 0.113161888 2.851866e-01
## rs794237_A 0.01817796 5.231689e-01 0.113161888 1.000000000 4.071972e-02
## rs875294_G 0.27833970 2.096868e-05 0.285186589 0.040719725 1.000000e+00
## rs11203258_A 0.68169997 1.677872e-02 0.778598069 0.019095046 4.384997e-01
## rs12028681_C 0.09914235 2.473231e-01 0.002301966 0.134546335 2.283325e-01
## rs16861446_T 0.01446345 4.270372e-01 0.091414952 0.001827436 4.595640e-02
## rs4532860_T 0.01344482 3.015810e-01 0.063042662 0.607214541 3.805433e-03
## rs1316880_A 0.05654193 1.919728e-02 0.082634547 0.008873524 2.496794e-01
## rs11203258_A rs12028681_C rs16861446_T rs4532860_T rs1316880_A
## rs6686302_A 0.68169997 0.099142347 0.014463446 0.013444817 0.056541929
## rs1192625_T 0.01677872 0.247323141 0.427037225 0.301580972 0.019197281
## rs794239_G 0.77859807 0.002301966 0.091414952 0.063042662 0.082634547
## rs794237_A 0.01909505 0.134546335 0.001827436 0.607214541 0.008873524
## rs875294_G 0.43849968 0.228332450 0.045956396 0.003805433 0.249679425
## rs11203258_A 1.00000000 0.017580897 0.116133880 0.015186220 0.066551520
## rs12028681_C 0.01758090 1.000000000 0.102357662 0.195017486 0.513059606
## rs16861446_T 0.11613388 0.102357662 1.000000000 0.002936625 0.009921112
## rs4532860_T 0.01518622 0.195017486 0.002936625 1.000000000 0.014212185
## rs1316880_A 0.06655152 0.513059606 0.009921112 0.014212185 1.000000000
Notice the matrix is symmetrical - the diagonal values are 1 (correlation of a variable with itself is 1), and the off-diagonals are the pairwise correlations.
We can visualise this using a heatmap in R:
heatmap(pairwise_cor, Rowv=NA, Colv=NA, col = terrain.colors(256), scale="column", margins=c(5,10))
How many LD blocks do you see?